Search CORE

15 research outputs found

Using linguistic knowledge in SMT

Author: Zbib Rabih M. (Rabih Mohamed), 1974-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (Ph. D. in Information Technology)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 153-162).In this thesis, we present methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). One of the advantages of the statistical approach to machine translation is that it is largely language-agnostic. Machine learning models are used to automatically learn translation patterns from data. SMT can, however, be improved by using linguistic knowledge to address specific areas of the translation process, where translations would be hard to learn fully automatically. We present methods that use linguistic knowledge at various levels to improve statistical machine translation, focusing on Arabic-English translation as a case study. In the first part, morphological information is used to preprocess the Arabic text for Arabic-to-English and English-to-Arabic translation, which reduces the gap in the complexity of the morphology between Arabic and English. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, we show how additional local context information on the source side is incorporated, which helps reduce lexical ambiguity. Two methods are proposed for using binary decision trees to control the amount of context information introduced. These methods are successfully applied to the use of diacritized Arabic source in Arabic-to-English translation. The final method combines the outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system.by Rabih M. Zbib.Ph.D.in Information Technolog

DSpace@MIT

Automated information distribution in low bandwidth environments

Author: Zbib Rabih M. (Rabih Mohamed), 1974-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1999
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 1999.Includes bibliographical references (leaves 62-64).by Rabih M. Zbib.S.M

DSpace@MIT

Statistical Machine Translation Features with Multitask Tensor Networks

Author: Devlin Jacob
Huang Zhongqiang
Lamar Thomas
Makhoul John
Schwartz Richard
Setiawan Hendra
Zbib Rabih
Publication venue
Publication date: 01/01/2015
Field of study

We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.Comment: 11 pages (9 content + 2 references), 2 figures, accepted to ACL 2015 as a long pape

arXiv.org e-Print Archive

Crossref

R\'esum\'e Parsing as Hierarchical Sequence Labeling: An Empirical Study

Author: Aizpuru Juan
Fabregat Hermenegildo
Retyk Federico
Taglio Mariana
Zbib Rabih
Publication venue
Publication date: 13/09/2023
Field of study

Extracting information from r\'esum\'es is typically formulated as a two-stage problem, where the document is first segmented into sections and then each section is processed individually to extract the target entities. Instead, we cast the whole problem as sequence labeling in two levels -- lines and tokens -- and study model architectures for solving both tasks simultaneously. We build high-quality r\'esum\'e parsing corpora in English, French, Chinese, Spanish, German, Portuguese, and Swedish. Based on these corpora, we present experimental results that demonstrate the effectiveness of the proposed models for the information extraction task, outperforming approaches introduced in previous work. We conduct an ablation study of the proposed architectures. We also analyze both model performance and resource efficiency, and describe the trade-offs for model deployment in the context of a production environment.Comment: RecSys in HR'23: The 3rd Workshop on Recommender Systems for Human Resources, in conjunction with the 17th ACM Conference on Recommender Systems, September 18--22, 2023, Singapore, Singapor

arXiv.org e-Print Archive

Segmentation for english-to-arabic statistical machine translation

Author: Ibrahim Badr
James Glass
Rabih Zbib
Publication venue
Publication date: 01/01/2008
Field of study

In this paper, we report on a set of initial results for English-to-Arabic Statistical Machine Translation (SMT). We show that morphological decomposition of the Arabic source is beneficial, especially for smaller-size corpora, and investigate different recombination techniques. We also report on the use of Factored Translation Models for Englishto-Arabic translation.

CiteSeerX

Crossref

Decision Trees for Lexical Smoothing in Statistical Machine Translation

Author: John Makhoul
Rabih Zbib
Richard Schwartz
Spyros Matsoukas
Publication venue
Publication date: 01/01/2010
Field of study

We present a method for incorporating arbitrary context-informed word attributes into statistical machine translation by clustering attribute-quali ed source words, and smoothing their word translation probabilities using binary decision trees. We describe two ways in which the decision trees are used in machine translation: by using the attribute-quali ed source word clusters directly, or by using attributedependent lexical translation probabilities that are obtained from the trees, as a lexical smoothing feature in the decoder model. We present experiments using Arabic-to-English newswire data, and using Arabic diacritics and part-ofspeech as source word attributes, and show that the proposed method improves on a state-of-the-art translation system.

CiteSeerX

Improved morphological decomposition for Arabic broadcast news transcription

Author: Ng Tim
Nguyen Kham
Nguyen Long
Zbib Rabih M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2009
Field of study

In this paper, we show the progress for Arabic speech recognition by incorporating contextual information into the process of morphological decomposition. The new approach achieves lower out-of-vocabulary and word error rates when compared to our previous work, in which the morphological decomposition relies on word-level information only. We also describe how the vocalization procedure is improved to produce pronunciations for some dialect Arabic words. By using the new approach, we reduced the word error by 0.8% absolute (4.7% relative) when compared to the baseline approach.United States. Defense Advanced Research Projects Agency (DARPA). GALE program (Contract No. HR0011-06-C-0022

DSpace@MIT

Fast and Robust Neural Network Joint Models for Statistical Machine Translation

Author: Jacob Devlin
John Makhoul
Rabih Zbib
Richard Schwartz
Thomas Lamar
Zhongqiang Huang
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

Recent work has shown success in us-ing neural network language models (NNLMs) as features in MT systems. Here, we present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window. Our model is purely lexi-calized and can be integrated into any MT decoder. We also present several varia-tions of the NNJM which provide signif-icant additive improvements. Although the model is quite simple, it yields strong empirical results. On the NIST OpenMT12 Arabic-English condi-tion, the NNJM features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM. The NNJM features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chi-ang’s (2007) original Hiero implementa-tion. Additionally, we describe two novel tech-niques for overcoming the historically high cost of using NNLM-style models in MT decoding. These techniques speed up NNJM computation by a factor of 10,000x, making the model as fast as a standard back-off LM. This work was supported by DARPA/I2O Contract No. HR0011-12-C-0014 under the BOLT program (Approved fo

CiteSeerX

Crossref